-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: fix unrecoverable freezes of rabbit's consumer #10594
base: dev
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## dev #10594 +/- ##
==========================================
- Coverage 81.93% 81.93% -0.01%
==========================================
Files 1079 1079
Lines 107380 107376 -4
Branches 737 737
==========================================
- Hits 87984 87978 -6
- Misses 19356 19358 +2
Partials 40 40
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@@ -316,7 +320,7 @@ class WorkerCommand : CliCommand { | |||
if (!channel.isOpen()) break | |||
} | |||
|
|||
return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to this work out with multithreading?
I don't quite understand when this line is reached actually. It is when all threads have died, or just one?
Maybe some time we should rewrite this class to lower the amount of nested callbacks and functions. It's a little difficult to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, not understanding it all either.
After quite a few hand-tests and trying various things I landed a different proposition: please take a look at the (updated) main comment of the PR (which may be updated again later - will notify).
06726c5
to
ee9d8aa
Compare
b9647fc
to
c8bd86f
Compare
54ad82d
to
17f2e86
Compare
Hard fix (kind of) to kill the process (and let orchestrator restart) if: * rabbit shuts down (triggering consumer's ShutdownCallback) * or an exception before starting basicConsume() goes up to main() (even if other threads may run) Bump amqp-client on the way as it doesn't hurt Signed-off-by: Pierre-Etienne Bougué <[email protected]>
From hand-tests, shutdown is already covered by the System.exit in App.java::main(). Signed-off-by: Pierre-Etienne Bougué <[email protected]>
…llback Hard fix (kind of) to kill the process (and let orchestrator restart) if an exception goes all the way up to the DeliverCallback. For example when not able to reach editoast for infra reload. This will release unacked messages and move them back to ready (instead of keeping them unacked until the worker exits). Signed-off-by: Pierre-Etienne Bougué <[email protected]>
3ecda5e
to
e6c5f8c
Compare
A third commit was pushed, and the main comment updated (and all rebased on No more work is planned on this, please read the main comment, any feedback is welcome, and we should be good to go 🙏 |
There is a bit more work, actually (for me): test if it improves the case described in #10704 |
Hard fix (kind of) to kill the process if a thread's exception goes up to
main()
, or if rabbit's cancel/shutdown notification is received (even if other threads may run) and let orchestrator restart it.Bump amqp-client on the way as it doesn't hurt.
Fixes #8621
Also reproduced and fixed the case that leads to the following (different) core logs:
Hand-tested
core-req-all
queue to initiate cancel notification: ✅core logs "consumer cancelled" then stops (should cover Core message consumer fails and blocks the scenario #8621).Then stop editoast and start single-worker core (so impossible to load infra inside DeliverCallback): ✔️❓Exceptions in threads when trying to load infra for pending requests, core stays alive (leaving pending requests unacked until core is stopped - after step below).DONE in last commit.➡️We can try/catch in
callback
function andexitProcess
to force shutdown.Then start editoast (keep core as-is) and initiate some new core request (using front): ✅core "correctly" processing only new requests.Stop core: ✅Unacked requests are back to ready.Start core: ✅Ready requests are processed.Understanding of previous and current work
Previous:
Current:
return
) and it may be an idea to use shutdown notification to exit cleanlyLooks like some improvements may be done (to be explored later, sorted by ROI)
isRecovarable
, cleanup consumer logs, properly close channels and connections)